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Abstract 

In sentence modeling and classification, 
convolutional neural network approaches 
have recently achieved state-of-the-art re¬ 
sults, but all such efforts process word vec¬ 
tors sequentially and neglect long-distance 
dependencies. To combine deep learn¬ 
ing with linguistic structures, we pro¬ 
pose a dependency-based convolution ap¬ 
proach, making use of tree-based n-grams 
rather than surface ones, thus utlizing non¬ 
local interactions between words. Our 
model improves sequential baselines on all 
four sentiment and question classification 
tasks, and achieves the highest published 
accuracy on TREC. 

1 Introduction 

Convolutional neural networks (CNNs), originally 
invented in computer vision (LeCun et ah, 1995), 
has recently attracted much attention in natural 
language processing (NLP) on problems such as 
sequence labeling (Collobert et ah, 2011), seman¬ 
tic parsing (Yih et ah, 2014), and search query 
retrieval (Shen et ah, 2014). In particular, recent 
work on CNN-based sentence modeling (Kalch- 
brenner et ah, 2014; Kim, 2014) has achieved ex¬ 
cellent, often state-of-the-art, results on various 
classification tasks such as sentiment, subjectivity, 
and question-type classification. However, despite 
their celebrated success, there remains a major 
limitation from the linguistics perspective: CNNs, 
being invented on pixel matrices in image process¬ 
ing, only consider sequential n-grams that are con¬ 
secutive on the surface string and neglect long¬ 
distance dependencies, while the latter play an im¬ 
portant role in many linguistic phenomena such as 
negation, subordination, and w/i-extraction, all of 
which might dully affect the sentiment, subjectiv¬ 
ity, or other categorization of the sentence. 

* This work was done at both IBM and CUNY, and was supported in 
part by DARPA FA8750-13-2-0041 (DEFT), and NSF IIS-1449278. We thank 
Yoon Kim for sharing his code, and James Cross and Kai Zhao for discussions. 


Indeed, in the sentiment analysis literature, re¬ 
searchers have incorporated long-distance infor¬ 
mation from syntactic parse trees, but the results 
are somewhat inconsistent: some reported small 
improvements (Gamon, 2004; Matsumoto et ah, 
2005), while some otherwise (Dave et ah, 2003; 
Kudo and Matsumoto, 2004). As a result, syn¬ 
tactic features have yet to become popular in the 
sentiment analysis community. We suspect one 
of the reasons for this is data sparsity (according 
to our experiments, tree n-grams are significantly 
sparser than surface n-grams), but this problem 
has largely been alleviated by the recent advances 
in word embedding. Can we combine the advan¬ 
tages of both worlds? 

So we propose a very simple dependency-based 
convolutional neural networks (DCNNs). Our 
model is similar to Kim (2014), but while his se¬ 
quential CNNs put a word in its sequential con¬ 
text, ours considers a word and its parent, grand¬ 
parent, great-grand-parent, and siblings on the de¬ 
pendency tree. This way we incorporate long¬ 
distance information that are otherwise unavail¬ 
able on the surface string. 

Experiments on three classification tasks 
demonstrate the superior performance of our 
DCNNs over the baseline sequential CNNs. In 
particular, our accuracy on the TREC dataset 
outperforms all previously published results 
in the literature, including those with heavy 
hand-engineered features. 

Independently of this work, Mou et al. (2015, 
unpublished) reported related efforts; see Sec. 3.3. 

2 Dependency-based Convolution 

The original CNN, first proposed by LeCun et 
al. (1995), applies convolution kernels on a se¬ 
ries of continuous areas of given images, and was 
adapted to NLP by Collobert et al. (2011). Eol- 
lowing Kim (2014), one dimensional convolution 
operates the convolution kernel in sequential order 
in Equation 1, where x* G represents the d di¬ 
mensional word representation for the i-th word in 



Despite the film ’s shortcomings the stories are quietly moving 


Figure 1: Dependency tree of an example sentence from the Movie Reviews dataset. 


the sentence, and © is the concatenation operator. 
Therefore Xij refers to concatenated word vector 
from the i-th word to the (i + j)-th word: 

XjJ = Xj © Xj+I © • • • © Xj+J (1) 

Sequential word concatenation Xij works as 
n-gram models which feeds local information into 
convolution operations. However, this setting can 
not capture long-distance relationships unless we 
enlarge the window indefinitely which would in¬ 
evitably cause the data sparsity problem. 

In order to capture the long-distance dependen¬ 
cies we propose the dependency-based convolu¬ 
tion model (DCNN). Figure 1 illustrates an exam¬ 
ple from the Movie Reviews (MR) dataset (Pang 
and Lee, 2005). The sentiment of this sentence 
is obviously positive, but this is quite difficult for 
sequential CNNs because many n-gram windows 
would include the highly negative word “short¬ 
comings”, and the distance between “Despite” and 
“shortcomings” is quite long. DCNN, however, 
could capture the tree-based bigram “Despite - 
shortcomings”, thus flipping fhe senfimenf, and 
fhe free-based frigram “ROOT - moving - sfo- 
ries”, which is highly positive. 

2,1 Convolution on Ancestor Paths 

We define our concafenafion based on fhe depen¬ 
dency free for a given modifier x*: 

Xi,fc = Xj © Xp(j) © • • • © Xpfc-i(j) (2) 

where function p^{i) refurns fhe f-fh word’s A:-fh 
ancesfor index, which is recursively defined as: 

pHi) = 

Figure 2 (leff) illusfrafes ancesfor pafhs pafferns 
wifh various orders. We always sfarf fhe convo- 
lufion wifh Xi and concafenafe wifh ifs ancesfors. 
If fhe roof node is reached, we add “ROOT” as 
dummy ancestors (vertical padding). 

For a given free-based concafenafed word se¬ 
quence Xj fc, fhe convolufion operation applies a 
filler w G to Xj ^ with a bias term b de¬ 

scribed in equation 4: 

© = /(w • Xj^fc + b) (4) 


where / is a non-linear activation function such as 
rectified linear unif (ReLu) or sigmoid funclion. 
The filler w is applied lo each word in fhe sen¬ 
tence, generating fhe fealure map c G 

c = [ci,C2, • • • ,q] (5) 

where I is fhe lenglh of fhe senlence. 

2.2 Max-Over-Tree Pooling and Dropout 

The filters convolve with different word concate¬ 
nation in Eq. 4 can be regarded as pattern detec¬ 
tion: only the most similar pattern between the 
words and the filter could return the maximum ac¬ 
tivation. In sequential CNNs, max-over-time pool¬ 
ing (Collobert et ah, 2011; Kim, 2014) operates 
over the feature map to get the maximum acti¬ 
vation c = max c representing the entire feature 
map. Our DCNNs also pool the maximum activa¬ 
tion from feature map to detect the strongest ac¬ 
tivation over the whole tree (i.e., over the whole 
sentence). Since the tree no longer defines a se¬ 
quential “time” direction, we refer to our pooling 
as “max-over-tree” pooling. 

In order to capture enough variations, we ran¬ 
domly initialize the set of filters to detect different 
structure patterns. Each filter’s height is the num¬ 
ber of words considered and the width is always 
equal to the dimensionality d of word representa¬ 
tion. Each filter will be represented by only one 
feature after max-over-tree pooling. After a series 
of convolution with different filter with different 
heights, multiple features carry different structural 
information become the final representation of the 
input sentence. Then, this sentence representation 
is passed to a fully connected soft-max layer and 
outputs a distribution over different labels. 

Neural networks often suffer from overtrain¬ 
ing. Eollowing Kim (2014), we employ random 
dropout on penultimate layer (Hinton et ah, 2014). 
in order to prevent co-adaptation of hidden units. 
In our experiments, we set our drop out rate as 0.5 
and learning rate as 0.95 by default. Eollowing 
Kim (2014), training is done through stochastic 
gradient descent over shuffled mini-batches with 
the Adadelta update rule (Zeiler, 2012). 
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Figure 2: Convolution patterns on trees. Word eoneatenation always starts with m, while h, g, and 
denote parent, grand parent, and great-grand parent, ete., and denotes words exeluded in eonvolution. 


2.3 Convolution on Siblings 

Aneestor paths alone is not enough to eapture 
many linguistie phenomena sueh as eonjunetion. 
Inspired by higher-order dependeney parsing (Me- 
Donald and Pereira, 2006; Koo and Collins, 2010), 
we also ineorporate siblings for a given word in 
various ways. See Figure 2 (right) for details. 

2.4 Combined Model 

Powerful as it is, struetural information still does 
not fully eover sequential information. Also, pars¬ 
ing errors (whieh are eommon espeeially for in¬ 
formal text sueh as online reviews) direetly affeet 
DCNN performanee while sequential n-grams are 
always eorreetly observed. To best exploit both in¬ 
formation, we want to eombine both models. The 
easiest way of eombination is to eoneatenate these 
representations together, then feed into fully eon- 
neeted soft-max neural networks. In these eases, 
eombine with different feature from different type 
of sourees eould stabilize the performanee. The 
final sentenee representation is thus: 
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where Na, Ng, and N are the number of aneestor, 
sibling, and sequential filters. In praetiee, we use 
100 filters for eaeh template in Figure 2 . The fully 
eombined representation is 1,100-dimensional by 
eontrast to 300-dimensional for sequential CNN. 

3 Experiments 

Table 1 summarizes results in the eontext of other 
high-performing efforts in the literature. We use 
three benehmark datasets in two eategories: senti¬ 
ment analysis on both Movie Review (MR) (Pang 
and Lee, 2005) and Stanford Sentiment Treebank 
(SST-1) (Soeher et ah, 2013) datasets, and ques¬ 
tion elassifieation on TREC (Li and Roth, 2002). 


For all datasets, we first obtain the dependeney 
parse tree from Stanford parser (Manning et ah, 
2014).^ Different window size for different ehoiee 
of eonvolution are shown in Figure 2. For the 
dataset without a development set (MR), we ran¬ 
domly ehoose 10% of the training data to indieate 
early stopping. In order to have a fare eompari- 
son with baseline CNN, we also use 3 to 5 as our 
window size. Most of our results are generated by 
GPU due to its effieieney, however CPU eould po¬ 
tentially get better results.^ Our implementation, 
on top of Kim (2014)’s eode,^ will be released."* 

3.1 Sentiment Analysis 

Both sentiment analysis datasets (MR and SST- 
1) are based on movie reviews. The differenees 
between them are mainly in the different num¬ 
bers of eategories and whether the standard split 
is given. There are 10,662 sentenees in the MR 
dataset. Eaeh instanee is labeled positive or neg¬ 
ative, and in most eases eontains one sentenee. 
Sinee no standard data split is given, following the 
literature we use 10 fold eross validation to inelude 
every sentenee in training and testing at least onee. 
Coneatenating with sibling and sequential infor¬ 
mation obviously improves DCNNs, and the final 
model oulperforms fhe baseline sequenfial CNNs 
by 0.4, and lies wilh Zhu el al. (2015). 

Differenl from MR, fhe Slanford Senlimenl 
Treebank (SST-1) annolales finer-grained labels, 
very posilive, posilive, neufral, negative and very 
negative, on an exfension of fhe MR dafasel. There 
are 11,855 senlenees wilh slandard splif. Our 
model aehieves an aeeuraey of 49.5 whieh is see- 
ond only lo Irsoy and Cardie (2014). 

^The phrase-structure trees in SST-1 are actually automatically parsed, 
and thus can not be used as gold-standard trees. 
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GPU only supports f loat32 while CPU supports f loat64. 
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https://github.comw/yoonkim/CNN_sentence 
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Category 

Model 

MR 

SST-1 

TREC 

TREC-2 


DCNNs; ancestor 

80.4t 

47.7t 

95.4t 

88.4t 

This work 

DCNNs; ancestor-Hsibling 

81.7t 

48.3t 

95.6t 

89.0t 


DCNNs; ancestor-Hsibling-Hsequential 

81.9 

49.5 

95.4t 

88 . 8 t 

CNNs 

CNNs-non-static (Kim, 2014) - baseline 

81.5 

48.0 

93.6 

86.4* 

CNNs-multichannel (Kim, 2014) 

81.1 

47.4 

92.2 

86 . 0 * 


Deep CNNs (Kalchbrenner et al., 2014) 

- 

48.5 

93.0 

- 

Recursive NNs 

Recursive Autoencoder (Socher et al., 2011) 

77.7 

43.2 

- 

- 

Recursive Neural Tensor (Socher et al., 2013) 

- 

45.7 

- 

- 


Deep Recursive NNs (Irsoy and Cardie, 2014) 

- 

49.8 

- 

- 

Recurrent NNs 

LSTM on tree (Zhu et al., 2015) 

81.9 

48.0 

- 

- 

Other deep learning 

Paragraph-Vec (Le and Mikolov, 2014) 

- 

48.7 

- 

- 

Hand-coded rules 

SVM 5 (Silva et al., 2011) 

- 


95.0 

90.8 


Table 1: Results on Movie Review (MR), Stanford Sentiment Treebank (SST-1), and TREC datasets. 
TREC-2 is TREC with fine grained labels. ^^Results generated by GPU (all others generated by CPU). 
^Results generated from Kim (2014)’s implementation. 


3.2 Question Classification 

In the TREC dataset, the entire dataset of 5,952 
sentenees are elassified into the following 6 eate- 
gories: abbreviation, entity, deseription, loeation 
and numerie. In this experiment, DCNNs easily 
outperform any other methods even with aneestor 
eonvolution only. DCNNs with sibling aehieve the 
best performanee in the published literature. DC¬ 
NNs eombined with sibling and sequential infor¬ 
mation might suffer from overfitting on the train¬ 
ing data based on our observation. One thing 
to note here is that our best result even exeeeds 
SVMs (Silva et al., 2011) with 60 hand-eoded 
rules. 

The TREC dataset also provides subeategories 
sueh as numeriedemperature, numerie:distanee, 
and entity :vehiele. To make our task more real- 
istie and ehallenging, we also test the proposed 
model with respeet to the 50 subeategories. There 
are obvious improvements over sequential CNNs 
from the last eolumn of Table 1. Eike ours, Silva 
et al. (2011) is a tree-based system but it uses 
eonstitueney trees eompared to ours dependeney 
trees. They report a higher fine-grained aeeuraey 
of 90.8 buf fheir parser is frained only on fhe Ques- 
fionBank (Judge ef al., 2006) while we used fhe 
sfandard Slanford parser frained on bofh fhe Penn 
Treebank and QuesfionBank. Moreover, as men- 
fioned above, fheir approaeh is rule-based while 
ours is aufomalieally learned. 

3.3 Discussions and Examples 

Compared wifh senfimenf analysis, fhe advanfage 
of our proposed model is obviously more subsfan- 
fial on fhe TREC dafasef. Based on our error anal¬ 
ysis, we eonelude fhaf fhis is mainly due fo fhe 


root 



What is Hawaii ’s state flower ? 


(a) enty loc 


root 



What is natural gas composed of ? 


(b) enty desc 


root 



What does a defibrillator do ? 


(c) desc enty 


root 



Nothing plot_wise is worth emailing home about 
(d) mild negative mild positive 



What is the temperature at the center of the earth ? 
(e) NUM:temp ^ NUM:dist 



What were Christopher Columbus ’ three ships ? 


(f) ENTY:veh ^ LOCrother 

Eigure 3: Examples from TREC (a-c), SST-1 (d) 
and TREC with fine-grained label (e-f) that are 
misclassified by the baseline CNN but correctly 
labeled by our DCNN. Eor example, (a) should be 
entity but is labeled location by CNN. 




root 



What is the speed hummingbirds fly ? 


(noun) 


(a) num => enty 



What body of water are the Canary Islands in ? 
(b) loc => enty 


root 



What position did Willie Davis play in baseball ? 
(c) hum =)> enty 


Figure 4: Examples from TREC datasets that are 
miselassified by DCNN but eorreetly labeled by 
baseline CNN. Eor example, (a) should be numer¬ 
ical but is labeled entity by DCNN. 

differenee of the parse tree quality between the 
two tasks. In sentiment analysis, the dataset is 
eolleeted from the Rotten Tomatoes website whieh 
ineludes many irregular usage of language. Some 
of the sentenees even eome from languages other 
than English. The errors in parse trees inevitably 
affeet the elassifieation aeeuraey. However, the 
parser works substantially better on the TREC 
dataset sinee all questions are in formal written 
English, and the training set for Stanford parser^ 
already includes the QuestionBank (Judge et ah, 
2006) which includes 2,000 TREC sentences. 

Eigure 3 visualizes examples where CNN errs 
while DCNN does not. Eor example, CNN la¬ 
bels (a) as location due to “Hawaii” and “state”, 
while the long-distance backbone “What - flower” 
is clearly asking for an entity. Similarly, in (d), 
DCNN captures the obviously negative tree-based 
trigram “Nothing - worth - emailing”. Note that 
our model also works with non-projective depen¬ 
dency trees such as the one in (b). The last two ex¬ 
amples in Eigure 3 visualize cases where DCNN 
outperforms the baseline CNNs in fine-grained 
TREC. In example (e), the word “temperature” is 
at second from the top and is root of a 8 word span 
“the ... earth”. When we use a window of size 5 
for tree convolution, every words in that span get 
convolved with “temperature” and this should be 
the reason why DCNN get correct. 

Eigure 4 showcases examples where baseline 
CNNs get better results than DCNNs. Example 
(a) is miselassified as entity by DCNN due to pars¬ 
ing/tagging error (the Stanford parser performs its 

http://nip.Stanford.edu/software/parser-faq.shtml 


root 



What is the melting point of copper ? 


(a) num =► enty and desc 


root 



What did Jesse Jackson organize ? 


(b) hum =► enty and enty 


root 



What is the electrical output in Madrid , Spain ? 
(c) enty => num and num 


Eigure 5: Examples from TREC datasets that are 
miselassified by both DCNN and baseline CNN. 
Eor example, (a) should be numerical but is la¬ 
beled entity by DCNN and description by CNN. 

own part-of-speech tagging). The word “fly” at 
the end of the sentence should be a verb instead of 
noun, and “hummingbirds fly” should be a relative 
clause modifying “speed”. 

There are some sentences that are miselassified 
by both the baseline CNN and DCNN. Eigure 5 
shows three such examples. Example (a) is not 
classified as numerical by both methods due to the 
ambiguous meaning of the word “point” which is 
difficult to capture by word embedding. This word 
can mean location, opinion, etc. Apparently, the 
numerical aspect is not captured by word embed¬ 
ding. Example (c) might be an annotation error. 

Shortly before submitting to ACE 2015 we 
learned Mou et al. (2015, unpublished) have inde¬ 
pendently reported concurrent and related efforts. 
Their constituency model, based on their unpub¬ 
lished work in programming languages (Mou et 
ah, 2014),^ performs convolution on pretrained re¬ 
cursive node representations rather than word em¬ 
beddings, thus baring little, if any, resemblance to 
our dependency-based model. Their dependency 
model is related, but always includes a node and 
all its children (resembling lyyer et al. (2014)), 
which is a variant of our sibling model and always 
flat. By contrast, our ancestor model looks at the 
vertical path from any word to its ancestors, being 
linguistically motivated (Shen et ah, 2008). 

4 Conclusions 

We have presented a very simple dependency- 
based convolution framework which outperforms 
sequential CNN baselines on modeling sentences. 

^Both their 2014 and 2015 reports proposed (independently of each other 
and independently of our work) the term “tree-based convolution” (TBCNN). 
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